Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Key information extraction algorithm of news Web pages
XIANG Jingjing, GENG Guanggang, LI Xiaodong
Journal of Computer Applications    2016, 36 (8): 2082-2086.   DOI: 10.11772/j.issn.1001-9081.2016.08.2082
Abstract633)      PDF (888KB)(597)       Save
Since information extraction algorithm for Web pages lacks generality and information of title, release-time and source in news Web page, a new information extraction algorithm was proposed to resolve those problems. Firstly, HTML code of Web page was parsed to text sets combined with line number and text; then, extractor began to search boundary of news content from line which the longest sentence belonged to due to the characteristic that the longest sentence belongs to the content of news with an extremely high probability. Meanwhile, the longest common string algorithm was used to extract title, the regular expression and line number were used to extract release-time, and the presentation characteristics of source and line number were used to extract source. Finally, a data set was built to conduct a comparison experiment with an open-source software named newsPaper in accuracy of extraction. Experimental results show that newsExtractor outperforms newsPaper in average accuracy of content, title, release-time and source, it has strong generality and robustness.
Reference | Related Articles | Metrics